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Abstract 



The use of L\ regularisation for sparse learn- 
ing has generated immense research inter- 
est, with many successful applications in di- 
verse areas such as signal acquisition, im- 
age coding, genomics and collaborative fil- 
tering. While existing work highlights the 
many advantages of L\ methods, in this pa- 
per we find that L\ regularisation often dra- 
matically under-performs in terms of predic- 
tive performance when compared to other 
methods for inferring sparsity. We focus on 
unsupervised latent variable models, and de- 
velop L± minimising factor models, Bayesian 
variants of "Li", and Bayesian models with 
a stronger L -like sparsity induced through 
spike-and-slab distributions. These spike- 
and-slab Bayesian factor models encourage 
sparsity while accounting for uncertainty in 
a principled manner, and avoid unnecessary 
shrinkage of non-zero values. We demon- 
strate on a number of data sets that in prac- 
tice spike-and-slab Bayesian methods out- 
perform L\ minimisation, even on a com- 
putational budget. We thus highlight the 
need to re-assess the wide use of L\ meth- 
ods in sparsity-reliant applications, particu- 
larly when we care about generalising to pre- 
viously unseen data, and provide an alterna- 
tive that, over many varying conditions, pro- 
vides improved generalisation performance. 
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1. Introduction 

Over the last decade, there has been tremendous 
excitement in learning parsimonious models using 
sparsity. Sparse learning is now a significant research 
topic - this significance being tied to the theoretical 
and practical advancement of sparse learning methods 
using the L\ norm. The use of the L\ norm in 
penalised regression problems such as the Lasso 



(Tibshirani 1996), in natural scene understanding 



and image coding problems (Olshausen and Field 



1996), and more recently in compressed sensing 



(Candes 20061, has served to cement the importance 
and efficacy of the L\ norm as a means of inducing 
sparsity. Among its important properties, the L\ 
norm is the closest convex norm to the Lq norm, 
has a number of provable properties relating to the 



optimality of solutions and oracle properties (van de 



Geer and Buhlmann 2009), and allows for the wide 



array of tools from convex optimisation to be used in 
computing sparse solutions. With the use of sparse 
methods in increasingly diverse application domains, 
it is timely to now contextualise the use of the L\ 
norm and critically evaluate its behaviour in relation 
to other competing methods. 

To achieve sparsity, the idealised but intractable 
sparsity criterion uses the Lq norm to penalise the 
number of non-zero parameters. To more closely 
match the Lq objective function, we develop here the 
use of discrete mixture priors for sparse learning, com- 



monly referred to as spike-and-slab priors ( 


Mitchell 


and Beauchamp |1988| Ishwaran and Rao 




2005). 



A spike-and-slab is a discrete mixture of a point 
mass at zero (the spike) and any other continuous 
distribution (the slab). It is is similar to the Lq 
norm in that it imposes a penalty on the number 
of non-zero parameters in a model. We show that 
spike-and-slab distributions provide improvements in 
learning, and that both Bayesian methods and the 
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use of the spike-and-slab distribution deserve more 
prominent attention in the vast literature for sparse 
modelling. 

Our analysis focuses on unsupervised linear latent 
variable models (also known as matrix completion 
models), a class of models that are amongst the core 
tools in the machine learning practitioner's toolbox. 
Factor analysis, the inspiration for this class of mod- 
els, describes real-valued data by a set of underlying 
factors that are linearly combined to explain the 
observed data. This base model allows for many 
adaptations, such as generalisations to non-Gaussian 




data 
or in 


Collins et al. 


2002 Mohamed et al. 




2008), 


earning sparse underlying factors ( 


Dueck and 


Frey 




2004 


Lee et al. 


2009 


Carvalho et al. 




2008). 



In unsupervised learning, a sparse representation 
is desirable in situations where: 1) there are many 
underlying factors that could explain the data, 2) 
only a subset of which explain the data, and 3) the 
subset is different for each observation. 

After introducing our framework for unsuper- 
vised models (section [2| , we develop approaches for 
sparse Bayesian learning, culminating in a thorough 
comparative analysis. Our contributions include: 

• We introduce new generalised latent variable models 
with strong sparsity, providing an important new 
class of sparse models that can readily handle non- 
Gaussian and heterogeneous data sets (sect. [4]). 

• We develop a spike-and-slab model for sparse unsu- 
pervised learning and derive a full MCMC algorithm 
for it. This MCMC method is applicable to other 
models based on discrete-continuous mixtures and 
is more efficient than naive samplers (sect. |3j). 

• We present the first comparison of approaches for 
sparse unsupervised learning based on optimisa- 
tion methods, Bayesian methods using continuous 
sparsity-favouring priors, and Bayesian methods us- 
ing the spike-and-slab. We bring these methods 
together and compare their performance in a con- 
trolled manner on both benchmark and real world 
data sets across a breadth of model types (sect. [6]). 

• Interestingly, our results show that strong sparsity 
in the from of spike-and-slab models can outper- 
form the commonly used L\ methods in unsuper- 
vised modelling tasks. 

2. Unsupervised Latent Variable 
Models and Sparsity 

We are concerned with models of the form: 

X = V0 + E, e„~AA(0,£), (1) 

which is the matrix factorisation problem in which 
we search for a set of underlying factors V and 



Figure 1. Graphical representation for generalised latent 
variable models. 

weights that are combined to explain the observed 
data X. We often consider Gaussian latent vari- 
ables and Gaussian noise with diagonal or isotropic 
covariances, in which case this model recovers the 
familiar factor analysis and principal components 
analysis models, respectively. If V is sparse then 
subsets of the underlying factors explain the data 
and different subsets explain each observed data point. 

Increasingly we do not deal with real-data, which is 
well described by a Gaussian distribution, but data 
that may be binary, categorical, non-negative or a 
heterogeneous set of these. It is interesting to then 
consider generalisations of the basic model ([I]) in 
which the conditional probability of the observed 
data is defined using the exponential family of 
distributions, as: 

x n |v n , ~ Expon (y~] v nk e^j ; 6 k ~ Conj (A, v) (2) 

We use the shorthand x„ ~ Expon (ip) to repre- 
sent the exponential family of distributions with 
natural parameters xp — v„0. For this model, 
the natural parameters are a sum of the parame- 
ters weighted by v nk , the points in the latent 
subspace corresponding to data point x n . For the 
exponential family of distributions, the conditional 
probability of x n given parameter vector xjj takes 
the form: p(x n \ip) = h(x n ) exp (s(x„) t t/> - A(x/>)) , 
where s(x„) are the sufficient statistics, ip is a vector 
of natural parameters and A(xp) is the log-partition 
function. Probability distributions that belong to the 
exponential family also have natural conjugate prior 
distributions, which we use to model the distribution 
of the parameters 0. We use the notation: Conj (A, v) 
as shorthand for the conjugate distribution, which 
has the form: p(8 k ) cx exp(A T 0j, — vA{6jA), with 
hyperparameters A and is, and A(6k) is the same 
log-partition function from the likelihood function. 

Figure ffl is a graphical representation of general 
unsupervised models; the shaded node x„ represents 
the observed data item n. The plate notation rep- 
resents replication of variables and the dashed node 
cp represents any appropriate prior distribution for 
the latent variables v„. The observed data forms an 
N x D matrix X, with rows x„. N is the number 
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of data points and D is the number of observed 
dimensions. is a K x D matrix with rows 0^. 
V is an N X K matrix with rows v„, which are 
if-dimensional vectors, where K is the number of 
latent factors. 

The K latent variables for each data point are 
generally assumed to be independent a priori: 
v„ ~ n*=i S{v n k\i^>), where S is the prior on each 
variable with hyperparameters <p (figure [lj. The 
prior distribution S(v„k) can be of any type. If 
the exponential family is Gaussian and we use 
Gaussian latent variables, we recover factor analysis; 
general exponential families corresponds to the well 
known exponential family PCA models (EPCA) 
(ICollins et all 120021 IMohamed et all 120081). Con- 



(|Mitchell and Beauchamp 1988 Ishwaran and Rao 



sidering non-Gaussian latent variables instantiates 
models such as ICA or the relevance vector machine 
(RVM) ( |Levin et al.| |2009j |Wipf and Nagarajan[ [2008| 



Unsupervised models with sparsity are obtained 
by employing sparsity- favouring distributions. A 
sparsity-favouring distribution can be any distribu- 
tion with high excess kurtosis, indicating that it is 
highly peaked with heavy tails, or a distribution with 
a delta-mass at zero. The set of sparsity-favouring 
distributions includes the Normal-Gamma, Normal 
Inverse-Gaussian, Laplace (or double Exponential), 
Exponential, or generally the class of scale-mixtures 
of Gaussian distributions (iPolson and Scottl 120101). 



Distributions that encourage sparsity fall into two 
classes: continuous sparsity-favouring or spike-and- 
slab distributions, which give rise to notions of weak 
and strong sparsity, respectively: 

Weak sparsity. A parameter vector u> is consid- 
ered to be 'weakly sparse' if none of its elements are 
exactly zero, but has most elements close to zero 
with a few large entries. This implies that a weakly 
sparse vector u> has a small L p norm for small p, or 
has entries that decay in absolute value according to 
some power law (Johnstone and Silverman 2004). 



Strong sparsity. A parameter vector u> is con- 
sidered to be 'strongly sparse' if elements of u; are 
exactly zero. The spike-and-slab prior places mass 
explicitly on zero and is thus a prior suited to achiev- 
ing this notion of sparsity in learning. 

3. Strongly Sparse Bayesian Models 

A Bayesian approach to learning averages model 
parameters and variables according to their posterior 
probability distribution given the data, rather than 
searching for a single best parameter setting as in an 
optimisation approach. To obtain Bayesian models 
with strong sparsity, we use a spike-and-slab prior 



[2005): a discrete-continuous mixture of a point mass 
at zero referred to as the 'spike' and any other distri- 
bution known as the 'slab'. This slab distribution is 
most often a uniform or Gaussian distribution, but 
may be any appropriate distribution. Since we have 
positive mass on zero, any samples produced include 
exact zeroes, thereby enforcing strong sparsity. The 
spike-and-slab can also be seen as placing a penalty 
on the number of non-zero parameters, and thus 
enforces sparsity in a manner similar to an Lq norm 
penalisation. MCMC allows us to stochastically find 
suitable solutions in this setting, where this is not 
possible otherwise due to the combinatorial nature of 
the optimisation. 

We construct a spike-and-slab prior using a bi- 
nary indicator matrix Z to indicate whether a latent 
dimension contributes to explaining the observed data 
or not. Each observed data point x„ has a corre- 
sponding vector of Bernoulli indicator variables z„. 
The spike components are combined with a Gaussian 
distribution, which forms the slab component: 



Znkl^k) 



n 



(1-TTfc) 



p(v„|z„,/X, S) = JJ N(v nk \z n kllk, ZnfcOfc)' 



(3) 



(4) 



where Af represents the Gaussian density with mean 
/ifc and variance a\. We place a Beta prior /3(iTk\e, /) 
on the Bernoulli parameters 7tk- When z„k = , 
p{vnk) in equation [4] becomes a S- function at zero, 
indicating the spike being chosen instead of the slab. 
We complete the model specification by using a 
Gaussian-Gamma prior for the unknown mean /Zfc and 
variance a\ . We denote the set of unknown variables 
to be inferred as f2 = {Z, V, 0, 7r, fi, X} and the set 
of hyperparameters \P = {e, /, A, v\. 

MCMC Sampling Scheme 

Since the spike-and-slab is not differentiable, many 
popular MCMC techniques, such as Hybrid Monte 
Carlo, are not applicable. We proceed in the context 
of Metropolis- within- Gibbs sampling, where we 
sequentially sample each of the unknown variables 
using Metropolis-Hastings. Our sampling procedure 
iterates through the following steps : 1) Sample Z 



and V jointly; 2) Sample by slice sampling (Neal 
2003); 3) Sample /x, X and tz by Gibbs sampling. 



In sampling the latent factors z n k and v n k in 
step 1, we first decide whether a latent factor 
contributes to the data or not by sampling z n k 
having integrated out v nk : p(z n k = 0|X, 7r,V-,„fe) and 
p{z n k — 1 l-X", 7T,V^„fe), where 'V^ nk are current values 
of V, with v n k excluded. Based on this decision, the 
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latent variable is sampled from the spike or the slab 
component. All variables v nk associated with the 
slab components are sampled using slice sampling. 
Evaluating these probabilities involves computing the 
following integrals: 

P (, nfe =0|X )7 r ! V^)=/K^=0,« nfe =0|Xy.„ fc , 7 r)^ 
= (l-ir k )p(X\V^ nk ,v nk =0,®) (5) 

p(z nk =l\Xy^ n k)=jp(z n k=l,v n k\Xy^ n k,7r)dv n k 
= ir k Jp(X\V , ®)AT(v nk \fi k , al)dv nk (6) 

While computing equation [5] is easy, the integral in 
equation [6] is not tractable in general. In the case of 
the Gaussian family, v nk can be marginalised and we 
do exactly this. For other families the integral must be 
approximated. A number of approximation methods 
exist such as Monte Carlo integration, importance 
sampling and pseudo-marginal approaches, and the 
Laplace approximation, which we use here. The use of 
Laplace's method introduces a bias due to the approx- 
imation of the target distribution. This problem has 
been studied by |Guihenneuc-Jouyaux and Rousseau| 
(2005) where the Laplace approximation is used in 



MCMC schemes with latent variables such as in our 
case, and show that such an approach can behave 



well. Guihenneuc-Jouyaux and Rousseau (20051 show 



that as the number of observations increases, the 
approximate distribution becomes close to the true 
distribution, and describe a number of assumptions 
for this to hold, such as requiring differentiability, a 
positive definite information matrix and conditions 
on the behaviour of the prior at boundaries of the 
parameter space. 

At least three other approaches for sampling the 
latent variables can be considered: 1) A more naive 
sampling of alternating between V and Z without 
integrating out the slab. 2) Sampling V after inte- 
grating Z. We found the collapsed scheme we describe 
in eq ([5|-([6]) quickly informs us of the state of the slab 
overall and resulted in faster mixing. 3) Reversible 
jump MCMC is also feasible and requires a different 
prior specification, also using a binary indicator vector 
but with a prior on the number of non-zero latent 
variables (e.g., using a Poisson). 

We sample V and in steps 1 and 2 by slice 



value for the parameter from an interval along the 
slice. The variables {fi, £} and 77 in step 3 have 
conjugate relationships with the latent variables V 
and Z respectively. Gibbs sampling is used since the 
full conditional distributions are easily derived^ 

4. Models with L\ norms and 
Sparsity-Favouring Priors 

The L\ norm has become the established mechanism 
with which to encode sparsity into many problems, 
and has a strong connection to continuous densities 
that promote sparsity. The L\ norm has a number 
of appealing properties: it gives the closest convex 
optimisation problem to the Lo problem; there is 
an broad theoretical basis with provable properties 
(L a — Li equivalence and exact recovery based on 
RIP); and can be implemented efficiently based on the 
tools of convex optimisation (linear and semi-definite 
programming) . 

Sparsity Inducing Loss Functions 

This leads us naturally to consider sparse latent 
variable models based on the L\ norm. If we 
assume that the latent distribution is a Laplace, 
<S(v„) oc exp(— a|jv„||i), the maximum a posteriori 
solution for V is equivalent to L\ norm regularisation 
in this model. We define the following objective for 
sparse generalised latent variable modelling: 

minV £(x n) v n 0)+ a|| V||i +/?#(©), (7) 
V,& *■ — <n 

where the loss function £ (x n , v„0) = - lnp(x„|v„0), 
is the negative log likelihood obtained using equation 
[2j Equation[7]provides a unifying framework for sparse 
models with L\ regularisation. The regularisation 
parameters a and /3, control the sparsity of the latent 
variables and the degree to which parameters will be 
penalised during learning. The function i?(0) is the 
regulariser for the model parameters 0. This model 
is specified generally and applicable for a wide choice 
of regularisation functions -R(-), including the L\ 
no rm. Such a loss function was described previously 
by Lee et al. (2009) - here we focus on unsupervised 



sampling (Neal 2003), which can be thought of as 
a general version of the Gibbs sampler. Sampling 
proceeds by alternately sampling an auxiliary variable 
it, the slice level, and then randomly drawing a 



settings and specify the loss more generally, allowing 
for both sparse activations as well as basis functions. 
One configuration we consider is the use of the 
modified loss (7]) with R(&) = - lnp(0| A, v). This 
loss allows sparsity in the latent variables and corre- 
sponds to finding the maximum a posteriori (MAP) 
solution. We shall refer to this model as the L\ model. 

Optimisation is performed by alternating min- 
imisation. Each step then reduces to established 

1 Implementation notes online at: cs .ubc . ca/~shakirm 
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problems for which, we can then rely on the extensive 
literature regarding Li norm minimisation. A number 
of methods exist to solve these problems: they can 
be recast as equivalent inequality constrained optimi- 
sation problems and solved using a modified LARS 



algorithm (Lee et al. 2006), recast as a second order 



cone program, or solved using a number of smooth 
approximations to the regularisation term (Schmidt 
et al. 2007), amongst others. 



Sparse Bayesian Learning 

Continuous densities with high excess kurtosis such 
as the zero- mean Laplace distribution or Student 's-t 
distribution are often used in Bayesian models where 
sparsity is desired. For a model with priors that 
prefer sparsity, the Bayesian averaging process often 
results in non-sparse posteriors and give solutions that 
are nearly zero, resulting in weakly sparse models. 
We consider two models with sparsity in the latent 
variables v n : 

Laplace Model. Using the Laplace distribution: 

v « ~ IlfcLi h bk ex P (-bk\v n k\), a Bayesian version of 
the Li model described by equation [7] can be spec- 
ified. The equivalence between this model and the 
L\ model can be seen by comparing the log-joint 
probability using the Laplace distribution, to the ~L\ 
loss of equation [7J We refer to Bayesian inference in 
this Laplace model as LXPCA, in contrast to the L\ 
model, which is an optimisation-based method. 
Exponential Model. If parameters or latent vari- 
ables are to be positively constrained, the natural 
choice would be an exponential distribution peaked 
at zero: v„ ~ IlfcLi ex P (—bkV n k), which has sim- 
ilar shrinkage properties to the Laplace. We refer to 
this model as NXPCA. 

These distributions are popular in sparse regression 



problems (Seeger. et al. 2007 Wipf and Nagarajan 



2008) and are natural candidates in the unsupervised 
models explored here. The hierarchical model spec- 
ification is completed by placing a Gamma prior on 
the unknown rate parameters b, with shared shape 
and scale parameters a and j3 respectively. We de- 
note the set of unknown variables to be inferred as 
fi = {V, 0, b} and the set of hyperparameters \I r = 
{a, (3, A, v). The joint probability of the model is: 



p(X, n|¥) =p(X|V, 0)p(0| A, i/)p(V|b)p(b|a, 0) (8) 

Inference in this model is accomplished using Markov 
Chain Monte Carlo (MCMC) methods, and the log of 
the joint probability Q is central to this sampling. 
We use a sampling approach based on Hybrid Monte 
Carlo (HMC). This can be implemented easily, and we 
defer the algorithmic details to |MacKay| ( |2003[ ) . 



5. Related Work 

The body of related work is broad and the work 
described here is far from exhaustive, but attempts to 
capture many papers of relevance in contextualising 
approaches to, and applications of sparse learning. 
There is a wide body of literature for sparse learning 
in problems of feature selection, compressed sensing 
and regression using the L\ norm, such as those by 



Tibshirani (19961; d'Aspremont et al. (2005); Candes 



( 2006 ) ; Lee et al. ( 2006 1 . Bayesian methods for sparse 
regression problems using continuous distributions 
have also been discussed by |Seeger. et aT] (|2007|); 



Carvalho et al. (20101; O'Hara and Sillanpaa (20091 



Wipf and Nagarajan (2008) derive a relationship 



between automatic relevance determination (ARD), 
maximum likelihood and iterative L\ optimization. 



Archambeau and Bach (2009) provide a nice explo- 



ration of ARD-related priors and variational EM for 
sparse PCA and sparse CCA. 

Of relevance to unsupervised learning of real- 



valued data is sparse PCA and its variants ( 


Zou et al. 


2004 


d'Aspremont et al. 


2005 


Rattray et al. 


2009). 



The wide body of literature on matrix factorisation 



is also indirectly related (Airoldi et al. 2008). These 



methods do not deal with the exponential family 
generalisation and may yield sparse factors as a 
by-product, rather than by construction. There are 
also many other papers of relevance in bioinformatics, 



computer vision, ICA and blind deconvolution (Levin 



et al. 2009 1 . The methods we develop here also have 
a strong bearing on the basis pursuit problem widely 
used in geophysics and other engineering fields and 
can allow not only for the solution of basis pursuit, 
but also in obtaining useful estimates of uncertainty. 

The use of 'spike-and-slab' sparsity for variable 
selection was established in statistics by |Mitchell 



and Beauchamp 



waran and Rao 



(1988) an d mo r e rec ently by Ish- 
(20051. lYenl ([Mil describes a 



majorisation-minimisation algorithm for MAP esti- 
mation, and Liicke and Sheikh (2012) describe EM 



for Gaussian sparse coding. Carvalho et al. (2008) 
use spike-and-slab-type priors to introduce sparsity 
in Bayesian factor regression models. They consider 
a hierarchical sparsity prior to reduce uncertainty 
as to whether a parameter is non-zero. This comes 
with increased computation and may not necessarily 



improve performance. Courville et al. (2010) describe 



spike-and-slab for deep belief networks. 



6. Experimental Results 

We consider the generalisation performance of unsu- 
pervised methods to unseen data, which appear as 
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Table 1. Summary of real data used. 



# 


Data 


N 


D 


Type 


1 

2 
3 
4 


Natural scones 
Animal attributes 
Newsgroups 
Hapmap 


10,000 
33 
100 
100 


144 
102 
200 
200 


Real 
Binary 
Counts 
Binary 



missing data. To handle missing data, we divide 
the data into a set of observed and missing data, 
X = {x obs ,X"" ssm f} and condition on the set X obs 
in the inference. We create test sets by randomly se- 
lecting 10% of the elements of the data matrix. Test 
elements are set as missing values in the training data, 
and our learning algorithms have been designed in all 
cases to handle missing data. We calculate the predic- 
tive probability (negative log probability, NLP) and 
the root mean squared error (RMSE) using the test- 
ing data. We created 20 such data sets, each with a 
different set of missing data, and provide mean and one 
standard deviation error bars for each of our evaluation 
metrics. For fairness, the regularisation parameters a 
and (3 in section [4] are chosen by cross-validation using 
a validation data set, which is chosen as 5% of the data 
elements. This set is independent of the data that has 
been set aside as training or testing data. 

6.1. Benchmark Data 



We use the block images data (Griffiths and Ghahra- 



mam 



2006) as a synthetic benchmark data set. The 
data consists of binary images, with each image x„ 
represented as a 36-dimensional vector. The images 
were generated with four latent features, each being a 
type of block. The observed data is a combination of 
a number of these latent features. Noise is added by 
flipping bits in the images with probability 0.1. This 
data set consists of a number of latent factors, only a 
subset of which contributes to explaining any single 
data point. This data is synthetic, but not generated 
from any of the models tested. 



Figure |2(a)| shows the NLP and RMSE on this 
benchmark data set. The methods developed are 



compared to EPCA (Collins et al. 2002), BXPCA 



(Mohamed et al. 20081 and to binary ICA (Kaban 



and Bingham 2006). A random predictor wou 

360 bits 



d have 
The models 



an NLP = 100 x 36 x 10% 
tested here have performance significantly better than 
this. Both optimisation-based and Bayesian learning 
approaches do well, but the spike-and-slab model 
shows the best performance with smaller error bars. 

6.2. Real Data 

We summarise the real data sets we use in table 
[l] (which includes data in the D > N regime). 
Natural images are the topic of much research 
based on L x regularisation. For the Olshauscn and| 





Figure 3. RMSE and NLP for the aimal attributes data. 



Field (1996) image data set, we use 12 x 12 image 
patches extracted from a set of larger images. We use 
the Gaussian instantiation of the sparse generalised 
model (equation [2| and evaluate the performance of: 
Li optimisation; a Laplace-prior factor model; and 
the Bayesian spike-and-slab model. Our results are 
shown for both underdetermined and overcomplete 



bases (K = 192 as in Olshausen and Field (1996)) 
in figure 2(b) All methods perform similarly in 



the low-rank approximation cases, but as the model 
becomes overcomplete, Bayesian methods perform 
better with the spike-and-slab method much better 
than other methods, particularly in reconstructing 
held-out/missing data. The animal attributes data 
set of Kemp and Tenenbaumj ([2008 ) consists of animal 
species with ecological and biological properties as 
features. We use the binary unsupervised model and 
show results for various latent dimensions for NLP 
and RMSE in figure [3} For this data, the NLP of a 
random classifier is 336 bits and the models have NLP 
values much lower than this. 

We also use a subset of the popular 20 news- 
groups data set, consisting of documents and counts 
of the words used in each document, with data 
sparsity of 93%. Figure |4(b)| shows the performance 
of the Poisson unsupervised model using L\ and 
spike-and-slab. Apart from the application of the 
model to count data, the results show that the 
spike-and-slab model is able to deal effectively with 
the sparse data and provides effective reconstructions 
and good predictive performance on held out data. 
We are also able to show the improved behaviour of 
the spike-and-slab model using the Hapmap data 



3_f 



4(c 



The comparative performance is shown in figure 
showing the spike-and-slab has performance 
similar to L\ in terms of RMSE at low K, but much 
better performance for large K. 

7. Discussion and Conclusion 

The common lore when using MCMC is that it is 
dramatically slower than optimisation methods. For 

2 Obtained from: https:/ /mathgen. stats. ox. ac.uk/impute/ 
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A 4 



I | S&Sfixed 



K=4 K=5 K=8 K=10 



III 



K=5 K=6 K=8 K=10 



K=5 K=1D K=5D K=100 K=200 




K 


ii 


Spike-Slab 


5 


475 ±36 


1446 ±24 


6 


483±57 


1418± 29 


8 
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Figure 4. (a) - (c) Comparison of predictive probabilities (NLP). 'S&S fixed' is the time- matched spike-and-slab perfor- 
mance (elaborated upon in sect. [7f . (d) Num. of non-zeros in newsgroups reconstruction - the true number is 1436. 



optimisation methods, the cross-validation procedure 
needed to set regularisation parameters a and j3, is 
computationally demanding due to the need to exe- 
cute the optimisation for many combinations of pa- 
rameters. This approach is also wasteful of data, since 
a separate validation data set is needed to make sen- 
sible choices of these values and to avoid model over- 
fitting. While individual optimisations may be quick, 
the overall procedure can take an extended time, which 
depends on the granularity of the grid over which reg- 
ularisation values are searched for. These parameters 
can be learnt in the Bayesian setting and have the ad- 
vantage that we obtain information about the distribu- 
tion of our latent variables, rather than point estimates 
and can have significantly better performance. 

Figure [4] demonstrates this tradeoff between running 
time and performance of the optimisation and the 
Bayesian approaches. L\ was allowed to run to conver- 
gence and the spike-and-slab for 200 iterations. In this 
instance, the Bayesian method is seemingly slower, but 
produced significantly better reconstructions in both 
the human judgements and newsgroups data. We con- 
sidered the setting where we have a fixed time budget 
and fixed the running time for the spike-and-slab to 
that used by the L\ model (including time to search 
for hyperparameters). The result is shown (as S&S 
fixed) in figure |4j which shows that even with a fixed 
time budget, MCMC performs better in this setting. 
The table of figure |4(d)| shows that the number of 
non-zeroes in the reconstructions for various K for the 



newsgroups data, with the true number of non-zeroes 
being 1436. L\ is poor in learning the structure of this 
sparse data set, whereas the spike-and-slab is robust 
to the data sparsity. 

All our results showed the spike-and-slab approach 
to have better performance than other methods com- 
pared in the same model class. The models based on 
the L\ norm or Bayesian models with continuous spar- 
sity favouring priors enforce global shrinkage on pa- 
rameters of the model. It is this property that induces 
the sparsity property, but which also results in the 
shrinkage of parameters of relevance to the data. This 
can be problematic in certain cases, such as the news- 
groups data set which resulted in overly sparse data 
reconstructions. The spike-and-slab has the ability 
to give both global and local shrinkage, thus allowing 
sparsity in the model parameters while not restricting 
parameters that contribute to explaining the data. 

Current approaches for sparse learning will have dif- 
ficulty scaling to large data sets in this regime. We 
might think of EP as a potential solution, such as 
used by Hernandez Lobato et al. (20101, but this is 



restricted to regression problems. For the standard 



Gaussian model, Rattray et al. (2009) discuss this is- 



sue and propose a hybrid VB-EP approach as one way 
of achieving fast inference, but such an approach is not 
ideal, leaving scope for future work. 

We have demonstrated that improved performance 
can be obtained by considering sparse Bayesian ap- 
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proaches. In particular, Bayesian learning with spike- 
and-slab priors consistently showed the best perfor- 
mance on held out data and produced accurate re- 
constructions, even in the 'large p' paradigm or with 
restricted running times. By considering the broad 
family of unsupervised latent variable models, we de- 
veloped a sparse generalised model and provided new 
sampling methods for sparse Bayesian learning using 
the spike-and-slab distribution. Importantly, we have 
provided the first comparison of sparse unsupervised 
learning using three approaches: optimisation using 
the L\ norm, Bayesian learning using continuous spar- 
sity favouring priors, and Bayesian learning using the 
spike-and-slab prior. We have also demonstrated our 
methods in diverse applications including text mod- 
elling, image coding and psychology showing the flex- 
ibility of the sparse models developed. These results 
show that Bayesian sparsity and spike-and-slab meth- 
ods warrant a more prominent role and wider use in 
sparse modelling applications. 
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